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To control targeted movements, such as reaching to grasp an object or hammering a nail, 
the brain can use divers sources of sensory information, such as vision and proprioception. 
Although a variety of studies have shown that sensory signals are optimally combined 
according to principles of maximum likelihood, increasing evidence indicates that the CNS 
does not compute a single, optimal estimation of the target's position to be compared 
with a single optimal estimation of the hand. Rather, it employs a more modular approach 
in which the overall behavior is built by computing multiple concurrent comparisons 
carried out simultaneously in a number of different reference frames. The results of 
these individual comparisons are then optimally combined in order to drive the hand. In 
this article we examine at a computational level two formulations of concurrent models 
for sensory integration and compare this to the more conventional model of converging 
multi-sensory signals. Through a review of published studies, both our own and those 
performed by others, we produce evidence favoring the concurrent formulations. We 
then examine in detail the effects of additive signal noise as information flows through 
the sensorimotor system. By taking into account the noise added by sensorimotor 
transformations, one can explain why the CNS may shift its reliance on one sensory 
modality toward a greater reliance on another and investigate under what conditions 
those sensory transformations occur. Careful consideration of how transformed signals 
will co-vary with the original source also provides insight into how the CNS chooses one 
sensory modality over another. These concepts can be used to explain why the CNS 
might, for instance, create a visual representation of a task that is otherwise limited to 
the kinesthetic domain (e.g., pointing with one hand to a finger on the other) and why the 
CNS might choose to recode sensory information in an external reference frame. 
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1. INTRODUCTION 

Reaching to grasp an object requires that the CNS compare the 
position and orientation of the object with the position and ori- 
entation of the hand in order to generate a motor command that 
will bring the hand to the object. Depending on the situation, the 
CNS might use more than one sensory modality, such as vision 
and proprioception, to sense the position and orientation of the 
target and of the hand, with each source of information encoded 
in its own intrinsic reference frame. This raises the question as to 
how the CNS combines these different sources of information to 
generate the appropriate motor commands. 

One school of thought contends that processes of sensor fusion 
for perception can be explained by the tenets of optimal esti- 
mation and control. According to the principles of maximum 
likelihood estimation, sensory signals that contain redundant 
information should be combined based on the expected variabil- 
ity of each so as to maximize the probability of producing a value 
close to the true value of what is being measured. This concept 
has been used with success in recent years to explain how humans 
combine different sources of sensory information to generate 
robust estimates of the position, size and orientation of external 



objects (Landy et al, 1995; Ernst and Banks, 2002; Kersten et al, 
2004; Kording et al, 2007). Of greater interest for us, however, is 
the task of reaching an object with the hand, which adds addi- 
tional aspects to the process beyond that of simple perception. 
The position and orientation of the object and of the hand must 
be effectively subtracted at some level, be it to compute a move- 
ment vector during task planning or to apply corrective actions 
based on real-time feedback during the course of the move- 
ment.This aspect of the task immediately brings to mind two 
additional issues that must be resolved: (1) To compare the posi- 
tion and orientation of two entities, sensory information about 
each must be expressed in a common coordinate frame. What ref- 
erence frame (s) are used to perform the requisite computations? 
(2) The fusion of redundant sensory information might occur at 
various stages in the perception-action cycle. Where and how are 
the principles of maximum likelihood applied? In this article we 
will contrast two possible models of sensor fusion, which we will 
call convergent and concurrent, as illustrated in Figure 1 for the 
task of hitting a nail with a hammer. 

The convergent model shown in Figure 1A reflects the conven- 
tional idea that the CNS constructs a single representation of the 
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FIGURE 1 | Convergent vs. concurrent models of sensorimotor 
integration. The two conceptual models are applied to a case in which 
visual (V) and kinesthetic (K) information can be used to estimate the 
positions of both the target (T) and the hand (H) that hold the hammer. Red 
and blue colors correspond to information encoded in retinal and body 
centered reference frames, respectively. (A) In the convergent model the 
visual and kinesthetic information about the target (x T ,v and x t ,k, 
respectively) are optimally combined to build a multimodal estimate of its 
position (x T ). Ditto for the hand/hammer position (x H ). The two optimal 
estimates are then compared (subtracted) to compute the movement 
vector Ax. (B) In the concurrent approach the positions of the target and 
the hand/hammer are compared simultaneously in visual and kinesthetic 
space. The two resulting unimodal movement vectors Ax v and Ax K are 
then optimally combined to compute the multimodal movement vector Ax. 



target based on all available sensory information. In the example 
of hammering a nail, this includes the position of the nail-head in 
the visual field and the position of the fingertips holding the nail 
as sensed by kinesthesia. Weighting can be used to privilege either 
the visual or the kinesthetic information in the estimate of the tar- 
get position; ditto for the estimation of the hammer's position and 
orientation, for which both visual and kinesthetic information 
are available. The combined representations are then compared in 
some reference frame that could be the reference frame intrinsic 
to one of the sensory modalities, or it could be some other, more 
generalized coordinate system. For instance, kinesthetic infor- 
mation could be transformed into retinal coordinates, or both 
visual and kinesthetic information could be transformed into a 
common reference frame centered on the head or on the trunk 
or referenced to external objects (Mclntyre et al, 1997; Guerraz 
et al, 1998; Henriques et al, 1998; Mclntyre et al, 1998; Carrozzo 
et al, 1999; Pouget et al, 2002a; Avillac et al, 2005; Obhi and 
Goodale, 2005; Byrne et al, 2010). Under this scheme, the CNS 
would combine all available sensory information about the target 
into a single, optimal representation of its position and orienta- 
tion. Similarly, sensory information would be combined to form 
an optimal representation of the hand's position and orientation 
in the same general reference frame. The comparison of target 
and hand would then be carried out within this general reference 
frame and the difference between the two positions would be used 
to drive the motor response. 



Figure IB shows the alternative hypothesis by which the CNS 
performs a distributed set of concurrent comparisons within each 
reference frame first, and then combines the results to form a 
unique movement vector (Tagliabue and Mclntyre, 2008, 2011, 
2012, 2013; McGuire and Sabes, 2009, 2011; Tagliabue et al, 
2013). In the example of hammering the nail, visual information 
about the nail-head is compared to visual information about the 
hammer while at the same time kinesthetic information about the 
hand holding the nail is compared with kinesthetic information 
about the hand swinging the hammer. Each comparison is carried 
out separately and thus may be carried out within the coordinate 
system intrinsic to the corresponding sensory modality. Under 
this formulation, a movement is programmed based on an opti- 
mal combination of the different movement vectors within each 
of the various reference frames. In this way the CNS accomplishes 
multimodal sensorimotor coordination in a modular fashion by 
performing a number of simpler target-hand comparisons in 
parallel. 

The purpose of this article is to examine in greater detail these 
two hypotheses of convergent versus concurrent comparisons of 
target and hand for reaching movements, both at a theoretical 
level and through a targeted review of the pertinent literature. 
In section 2 we differentiate further the two models at the con- 
ceptual level by showing mathematically how the application of 
optimal estimation differs between them. Using these equations, 
we go on to present the experimental evidence supporting the 
hypothesis that the CNS functions according to the concurrent 
model. In section 3 we examine the conditions in which the CNS 
will transform information from the intrinsic reference frame 
of one sensor to the reference frame of another. Key to this 
discussion is an assessment of how coordinate transformations 
and memory processes affect the variability of the outcome, and 
we explicitly take into account how co -variation of transformed 
signals affects the choice of weighting. Section 4 examines the 
time course of the underlying sensorimotor processes, providing 
insight into when sensorimotor transformations are actually per- 
formed and, as a corollary, indicating that not only does the CNS 
perform multiple comparisons in parallel, it maintains parallel 
memory traces in multiple reference frames as well. In section 
5 we generalize the concepts of convergent and concurrent pro- 
cesses to more than two sensory modalities, and in section 6 we 
use these formulations to consider trade-offs between using sen- 
sory information encoded in reference frames intrinsic to the 
sensors themselves or with respect to extrinsic reference frames 
such as the visual surrond or with respect to gravity. In the final 
section we describe some specific predictions made by differ- 
ent concurrent and convergent formulations and discuss how the 
models might be differentiated experimentally. 

2. MULTIPLE, CONCURRENT vs. MULTIMODAL, 
CONVERGENT 

The two models depicted in Figure 1 can be described mathemat- 
ically, in the linear case, as a set of weighted sums and differences. 
We use here linear formulations because they simplify the equa- 
tions and are sufficient to make predictions about how the two 
models might differ computationally and experimentally. The 
main feature of the convergent model in Figure 1 A is that a single 
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representation of the target is compared to a single representation 
of the hand in the common reference frame, and a movement is 
performed that reduces the difference of these two estimates, Ax, 
to zero. The equation describing this formulation is: 

Ax = (w T , v xr,v + w t,kxt,k) - (w h ,v^h,v + w h ,k*h,k) (1) 

where xr,v and xt,k represent the position of the target detected 
by vision and kinesthesia, respectively, xh,v and xh,k represent 
the detected position of the hammer in each of those reference 
frames, and wt,v, wt,k> wr,v and wr,k are the weights given to 
each of these pieces of information. In the concurrent model of 
Figure IB, target and hand are compared in the reference frame 
of each sensory modality first, and then the final movement vector 
Ax is computed as a weighted sum of the individual differences. 
This process can be described by the equation: 



Ax = X v (xr, v - xh,v) + ^k(*t,k - xh,k) 



(2) 



where \y and Xk represent the weight given to the comparisons 
carried out in each of the two sensory modalities. Common to 
both Equations (1 and 2) is the idea that redundant information 
from the various sensory modalities can be weighted differently 
through the factors w and X. In fact, Equation (2) is a special case 
of Equation (1), with the added constraint that within each sen- 
sory modality, signals about the target and the hand must have 
the same weight: 



(3) 



In the linear formulation used here, therefore, the computational 
difference between the two models is not so much in terms of the 
order in which sensory information is added or subtracted, but 
rather in terms of how the weighting factors w and X are chosen. 

The principles of maximum likelihood estimation (MLE) can 
be applied to both Equations (1 and 2) to find weighting factors 
that are in some sense optimal, although they differ in terms of 
what is optimized. The optimal estimation of a parameter p given 
noisy measurements (mi,...,m n ) corresponds to the value that 
maximizes the probability distribution P(m\, . . . , m n \p) which 
for independent measurements is equal to P(rai, . . . , m n \p) = 
YYi= i P( m i\p)- h° each measurement is considered to be governed 
by Gaussian noise, the optimal estimate is analytically derived to 
be the weighted average such that the relative weight given to any 
one of the component quantities is equal to the inverse of it's 
variance relative to all the other quantities: 



(4) 



J2 °mi 
i = 1 



where Q 2 m is the variance of measurement m/. Thus, noisy vari- 
ables are given less weight compared to those that are more 
reliable (Ghahramani et al., 1997). If weighted in this manner, the 
linear combination of different sources of information results in 
a reduction of output variability (i.e., an increase in movement 
precision) compared to the use of any one source of informa- 
tion alone. For illustration purposes, therefore, we assume that 



the noise exhibited by each sensory signal is Gaussian so that we 
may apply the linear maximal likelihood solution (Equation 4) to 
find the optimal weights. 

For the convergent model in Figure 1 A, applying MLE in order 
to compute the weighting factors (w's) in Equation (1) means that 
an optimal estimate of the position of the hand, derived from 
all available sensory feedback about the hand, will be compared 
to (subtracted from) an optimal estimate of the target's position, 
similarly derived from all available sources of sensory information 
about the target. Applying Equation (4) to the convergent model, 
the sets of weights for i = K and i = V are: 



Or 



T,x 



U T,V 



+ a T,K 



and wh,z = — 



"2 i^-2 
+ a H,K 



(5) 



J H,V 



The computation of weighting factors (X/s) for the parallel struc- 
ture in Figure IB is somewhat different. Here, target and hand 
are compared in both sensory modalities in parallel (Ax z - = 
x T,i ~ x H,i) and maximum likelihood then determines how much 
weight should be given to each of these comparisons, based on the 
expected variance of each of the computed differences. Given that 
the variance of a difference is simply the sum of the variances of its 
minuend and of its subtrahend (o 2 A . = Oj . ; + ■) and applying 
Equation (4), the weight given to each difference is computed as: 



-2 , -2 
a Av +a AK 



(6) 



Conceptually, therefore, the convergent and concurrent models 
differ primarily in terms of what is optimized. For the conver- 
gent model, an optimal estimate of the target and an optimal 
estimate of the hand are computed and then used to compute a 
movement vector. Under the concurrent model, multiple move- 
ment vectors are computed and then these vectors combined in 
an optimal fashion. Thus, even though Equations (1 and 2) are 
algebraically very similar, the choice of what to optimize when 
determining the various weights leads different results for the two 
different models. Note that the neural system may not operate 
in a strictly linear fashion, in which case differentiating between 
the two model structures would be even more important in terms 
of model predictions. But even the linear analysis presented here 
allows one to draw a distinction between the convergent and con- 
current models, both conceptually, as we have described here, and 
experimentally, as we will show in the following paragraphs. 

2.1. DISTINGUISHING BETWEEN MODELS 

When both target and hand can be localized via all the same sen- 
sory modalities, the convergent and modular formulations differ 
very little in terms of the predicted outcomes. In the example of 
hitting a nail with a hammer, this corresponds to the situation in 
which one can simultaneously see and feel with the hand both 
the hammer and the nail. In these circumstances, both models 
predict that more weight will be given to the most reliable (e.g., 
the least noisy) sensory channels. However, when only a subset of 
sensory information is available (e.g., only vision of the target or 
only kinesthesia about the hand), the two different formulations 
predict two substantially different outcomes. 
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Consider the situation of a nail that is already imbedded in 
the wall, such that it need not be held by the non-dominant 
hand (Figure 2A). Information about the target would therefore 
be limited to the visual domain. Compare this to hammering a 
nail that is held by the non- dominant hand, but whose head is 
obscured from view (Figure 2B). This example is perhaps not a 
very wise thing to do in real life, but it illustrates the point. To 
generalize, we will refer to these two types of tasks by the nota- 
tion V-VK (visual target, visual and kinesthetic hand feedback) 
and K-VK, respectively, and to the original case of hammering a 
hand-held nail with full vision of both target and hands as a VK- 
VK task. In the case of the convergent model (Figures 2C,D), the 
lack of one source of information about the target simply means 
that an optimal combination of the remaining sensory cues will 
be used to localize the target. Thus, in V-VK, a representation 
of the target based on visual cues, transformed into the com- 
mon reference frame, will be compared with a representation of 
the hand in that same reference frame derived from both visual 
and kinesthetic feedback. Similarly, in K-VK a representation of 
the target derived from kinesthetic information will be compared 



with a representation of the hand that is based on an optimal 
combination of visual and kinesthetic cues. 

Applying the concurrent scheme to the situations shown in 
Figures 2A,B> however, begs the question: What is to be done 
with kinesthetic information about the hand when the target 
is presented only visually (V-VK) and what is done with visual 
information about the hand when the target is localized only 
kinesthetically (K-VK)? One possibility (not shown) is that the 
CNS simply ignores information about the hand in any sen- 
sory modality that is not also used to localize the target, relying 
only on sensory information that is directly comparable. Thus, 
only visual information about the hand would be used in the 
V-VK situation and only kinesthetic information about the hand 
would be used in the K-VK situation. But by doing so, one 
would forfeit the added precision that could be obtained by using 
both sources of sensory information about the hand holding the 
hammer. Alternatively, as illustrated in Figures 2E,F, the CNS 
could reconstruct the missing sensory information about the tar- 
get by performing a cross-modal sensory transformation (green 
arrows). According to this arrangement, a kinesthetic representa- 
tion of the target will be derived from visual information in V-VK, 
allowing both the visual and the kinesthetic information from the 
hand to be utilized. Analogously, the target can be reconstructed 
in visual space in K-VK, again allowing the comparison of target 
and hand to be carried out in both the visual and the kinesthetic 
domains. 

The difference between the convergent and concurrent formu- 
lations becomes apparent if one compares the model predictions 
for V-VK versus K-VK in terms of the relative weighting given 
to visual or kinesthetic modalities. Consider first the concurrent 
models in Figures 2E,F. When computing the optimal weights Xy 
and Xk one must take into account not only the noise intrinsic to 
the sensory inputs, but also the noise added by cross-modal trans- 
formations (Soechting and Flanders, 1989; Tillery et al., 1991; 
Schlicht and Schrater, 2007) when a sensory input missing in one 
modality must be reconstructed from sensory signals in other. 
Taking into account this additional noise when applying Equation 
(6), one obtains for K-VK: 



A V-VK B K-VK 




FIGURE 2 | Incomplete sensory information. The computational structure 
of the convergent and concurrent models are compared for situations in 
which the target position can be sensed through (A) visual (x TjV ) or (B) 
kinesthetic (x T)K ) information only, whilst information from both sensory 
modalities (x H ,v and x h ,k) can be use to estimate the effector/hand position. 
Panels (C,D) represent how available sensory signals would be used 
following the Convergent Model in each of the two situations, respectively. 
Panels (E,F) illustrate the computational structure of the Concurrent Model 
for the same two situations. Green arrows represent the cross-modal 
sensory transformations that might be performed. Grayed out symbols 
indicate sensory inputs that are absent, as compared to the situation 
shown in Figure 1. All other notations and color conventions are the same 
as in Figure 1. 



^ _ (4k + CT H,k) 

( a T,K + a H, K ) + ( a T,K + 4,K^V + CT H,v) 

(22 2 i 

0 T,K + CT T,Ki-^V + CT H,V / 

(4,K + 0 H,k) + (4,K + 4 
and for V-VK: 

^ ( Q T,V + a T,Vh^K + Q H,k) 

(°T,V + °T,Vh>K + 4*) + (4,V + a H,v) 

^ _ ( a T,V + a H,v) 

(°T,V + 4vh>K + CFh,k) + ( a T,V + a H,v) 
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where a^ Kh ^ v and a^ Vh ^ K represent the noise added when recon- 
structing a visual representation of the target from kinesthetic 
information and the noise added when reconstructing the tar- 
get in kinesthetic space from visual information, respectively. One 
can see from these sets of equations that changing what sen- 
sory information is available about the target has the potential of 
changing the weight given to each type of sensory feedback used 
to guide the hand. Indeed, less weight (smaller X/s) will be given 
to the component comparisons that require the reconstruction of 
sensory information, due to the noise that these reconstructions 
add to the signals. In most cases, however, the weighting of the 
two component comparisons will shift toward the visual infor- 
mation when the target is visual (V-VK) and will shift toward 
the kinaesthetic domain when the target is kinaesthetic (K-VK). 
In the limit, if the transformation noise is very high compared 
to the input noise, the comparison that requires a sensorimotor 
reconstruction will be given zero weight, leaving only the direct 
comparison to drive the response. 

For the convergent model, there is no inherent need to recon- 
struct sensory information that is not available. The CNS would 
simply use all the available sensory information about the target 
and all available sensory information about the hand in order 
to compute an optimal estimate of the position of each. This 
does not mean, however, that no sensorimotor transformations 
are required to implement the concurrent formulation. On the 
contrary, in order to combine spatial information from different 
sources, the different pieces of information must be expressed in 
a common reference frame R. Thus, for the convergent model, 
coordinate transformations will be required even though no 
"reconstruction" of missing sensory information is needed. These 
transformations will also add noise which will affect the weight- 
ing between the different inputs and should therefore be explicitly 
considered when comparing the concurrent and convergent mod- 
els. According to Equations (1 and 5), the estimate of the hand's 
position and orientation will be based on a weighted sum of the 
visual and kinesthetic feedback, with the weight determined by 
the variance of the two feedback signals and by the noise added 
by the two sensorimotor transformations: 

_ g H,K + g H,Kh^R 

^ / 'H,V 2 2 2 2 

CT H,K + a H,Ki-+R + a H,V + CT H,Vi-^R 

(9) 

2 2 

_ g H,V + g H,Vh^R 

^H,K 2 2 2 2 

g H,K + g H,Kh^R + g H,V + g H,Vh^R 

One can see that even if one considers noise added by sensorimo- 
tor transformations, the convergent model, unlike the concurrent 
model, predicts that the weighting of sensory information will 
not change between V-VK and K-VK. Because the information 
available about the hand is the same in both V-VK and K-VK, the 
relative weight given to visual versus kinesthetic feedback about 
the hand will be the same in both circumstances, regardless of the 
sensory modality used to sense the target. 

The convergent and concurrent models make two different 
predictions, therefore, about what happens when the modality 
of the target is changed while full feedback of the hand is avail- 
able. These predictions allow one to differentiate between the two 



hypotheses experimentally. Indeed, a number of studies that have 
compared moving the hand to visual versus proprioceptive targets 
provide support for the hypothesis of concurrent comparisons 
shown Figure IB. For instance: 

• Sober and Sabes (2005) compared pointing to a visual target 
versus pointing with the one hand to the unseen index fin- 
ger of the other. They used virtual reality to introduce conflict 
between visual and proprioceptive feedback about the initial 
position of the pointing finger. By measuring the bias toward 
the visual or the proprioceptive feedback about the position 
of the finger, they found a significant difference in the relative 
weighting of visual and kinesthetic hand feedback depending 
on the modality of the target. 

• Sarlegna and Sainburg (2007) also used a virtual- reality tech- 
nique to dissociate visual and proprioceptive feedback about 
the hand's initial position. The choice of target modality (mov- 
ing to a visual target versus moving to the position of the other, 
unseen hand) had a significant effect on the contribution of 
vision versus proprioception to the control of the amplitude of 
rapid reaching movements. 

• McGuire and Sabes (2009) made use of the well-known reti- 
nal eccentricity effect (Bock, 1986) and imposed changes in gaze 
direction to measure the reliance on visual versus kinesthetic 
information. They found that when visual and kinesthetic 
information about the hand was available, deviations due to 
changes in gaze direction, which would indicate coding of the 
movement in retinal space, depended on the target modality 
(pointing to a visual target versus pointing to the unseen left 
hand). 

• Tagliabue and Mclntyre (2011) asked subjects to align the 
hand with a target in the fronto-parallel plane. They used a 
virtual reality technique to introduce conflict between visual 
and kinesthetic reference frames during a memory delay. In 
these experiments the sensory modality used to present the 
target orientation had a significant effect on the weight given 
to visual versus kinesthetic comparisons when driving the 
response, with a shift toward visual information when the tar- 
get was visual and kinesthetic information when the target was 
kinesthetic. 

Because their data could not be reconciled with the encoding of 
movement parameters exclusively in either retinotopic space or 
kinesthetic space, the authors of the last two studies each pro- 
posed versions of the concurrent structure depicted in Figure IB. 
The specifics of the models proposed by these different authors 
differ slightly from each other (more on the similarities and 
differences below) but both involve multiple comparisons in mul- 
tiple reference frames and both can explain a shift in weighting 
toward visual information when the target was visible and toward 
kinesthetic information when the target was kinesthetic. Thus, 
compared to the hypothesis of convergent, multi-modal sensory 
integration shown in Figure 1A, the computational structure of 
multiple, concurrent comparisons depicted in Figure IB provides 
a much more parsimonious explanation of the data reported from 
a number of different tasks and experimental paradigms. 
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3. TO RECONSTRUCT OR NOT TO RECONSTRUCT? 

Inherent to the concurrent model is the concept of sensory recon- 
struction. According to this idea, a visible target could be com- 
pared with proprioceptive information about the location of the 
hand if the visible information is transformed into propriocep- 
tive space. Some such reconstruction would be necessary when, 
for instance, reaching toward a visual target with the unseen hand 
(V-K). The question remains, however, as to whether the visual 
target should be transformed into kinesthetic space or whether 
a visual representation of the hand should be constructed based 
on proprioceptive information from the arm. Transforming target 
information into kinesthetic space would be optional in a V-VK 
situation, where a direct comparison of target and hand could be 
carried out in visual coordinates. It would be even more super- 
fluous to transform into visual space a purely kinesthetic (K-K) 
task. Yet the implication of visual representations in purely kines- 
thetic tasks is known to occur (Pouget et al, 2002b; Sober and 
Sabes, 2005; Sarlegna and Sainburg, 2007; McGuire and Sabes, 
2009; Jones and Henriques, 2010). A key question to be addressed, 
therefore, is that of how the CNS chooses which comparisons to 
apply to a given task, and how to weight the different compu- 
tations to arrive at the overall response. Under what conditions 
should information from one sensory modality be transformed 
into the reference frame of another? 

In our original publication (Tagliabue and Mclntyre, 2011) 
we argued that the CNS avoids sensory transformations, and 
thus performs direct comparisons whenever possible. Indeed, we 
observed that a V-VK task was carried out in visual coordinates 
while the equivalent K-VK task was carried out in kinesthetic 
space. (Note that we observed this result when subjects held their 
head upright. We saw a somewhat different result when subjects 
were asked to move their head during an imposed memory delay. 
We will discuss these latter results further down in this section). In 
our V-K and our K-V tasks, however, we observed that both visual 
and kinesthetic comparisons were performed, even though just 
one of these (and just one transformation) would have been suffi- 
cient. For instance, in V-K, subjects could have performed a single 
transformation of visual information into kinesthetic space, or 
they could have only transformed the kinesthetic hand informa- 
tion so as to perform the task in visual space. The fact that both 
transformations and both comparisons were performed shows 
that the CNS does sometimes perform "unnecessary" transfor- 
mations beyond what would be minimally necessary to achieve 
the task. 

In order to explain our results, and others, we had to resort to 
additional, albeit reasonable, assumptions that went beyond the 
basic tenets of MLE. The first was that direct comparisons are 
absolutely best, even though estimates of noise in the visual and 
kinesthetic channels and the conventional application of maxi- 
mum likelihood would predict a more graded weighting between 
visual and kinesthetic information for the V-V and K-K tasks. The 
second was that the necessity of a single transformation would 
provoke the execution of a whole range of transformations into 
a number of different reference frame or sensory modalities. This 
could explain why the CNS would reconstruct a visual representa- 
tion of a task that is otherwise purely kinesthetic, as was observed 
in the studies mentioned above. In the discussion of our results, 



we argued that this could be because a common neural network 
might generate the same amount of noise, whether performing 
one or many transformations. While this is a reasonable, and even 
testable, hypothesis, it still remains unproven and thus still con- 
stitutes, as of this writing, an ad hoc assumption that we had to 
invoke in order to reconcile empirical data with MLE. 

In a more recent study, however, we showed how MLE can 
explain much, if not all, of the available data without these addi- 
tional assumptions, if one properly accounts for co -variation of 
noise in sensory signals that have been reconstructed in one sen- 
sory modality from another (Tagliabue and Mclntyre, 2013). The 
issue of co -variation is important because it conditions how two 
signals should be optimally weighted. If two signals are stochasti- 
cally independent, the principle of maximal likelihood estimation 
says that the two quantities should be weighted according to the 
inverse of their respective expected variance. This weighted aver- 
age will tend to reduce the effects of the independent noise in each 
component. But if the noise in one is correlated with the noise in 
the other, computing the weighted average will be less effective in 
reducing the overall noise. In the limit, if the noise in the two vari- 
ables in perfectly correlated, then computing the weighted average 
will not reduce the overall noise at all. 

To correctly compensate for covariance between two signals in 
the computation of the optimal weights to be applied, one must 
essentially take into account only the independent components of 
noise within each variable. In the case of two non-independent 
variables that exhibit Gaussian noise, the weighted combination 
of x and y that will minimize the variance of the output: 

z = \x+(l-\)y (10) 

is given by the equation: 

(a 2 — covr v ) 1 
x= v x _ x ^ — T (ID 

(a 2 x - cov x , r ) 1 + (a y 2 - cov x ,y J 

where cov x , y is the covariance between x and y. Added insight 
can be achieved if one considers two components x and y are 
derived from two stochastically independent signals, p and q and 
a common component c: 

x = p + c 

(12) 

y = q + c 

In this case, which is directly applicable to the sensorimotor trans- 
formations that are being considered in this paper, the covariance 
between x and y is precisely equal to the variance of the common 
component c: 

2 2 , 2 
°x = °p + °c 

a 2 y = a\ + a 2 c (13) 
cov x?r = a] 
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FIGURE 3 | Direct vs. indirect comparisons (modified from Tagliabue 
and Mclntyre, 2013). The schematics represent the concurrent model 
applied to two tasks that are both purely kinesthetic (K-K). In the 
INTRA-manual task the subject feels the target position with the right hand 
(T,K R ) and reproduces it with the same hand (H,K R ). In the INTER-manual 
task the target is felt with the left hand (T,K L ) and its position is reproduced 
with the right (H,K R ). As in Figures 1, 2, red and blue arrows represent 
visual and kinesthetic signals, respectively, circular nodes represent 
movement vectors computed in different reference frames and green 
arrows represent sensory transformations. Each task can potentially be 
carried out partially in visual space by reconstructing a visual representation 
of the target (TV) and a visual representation of the hand (H,V) from 
available kinesthetic inputs. In the INTRA-, but not INTER-manual task, a 
direct comparison between the kinesthetic signals about target and 
response is possible. Taking into account co-variance between 
reconstructed signals, only in the INTER-condition would a reconstruction 
of an "unnecessary" visual representation reduce movement variability. 
Grayed-out symbols represent sensory inputs that are absent in each task 
while grayed-out green arrows depict sensory reconstructions that are 
given no weight when MLE is applied. 



and Equation (11) reduces to: 

In other words, the optimal weighting of x and y depends only on 
the variance of the independent components p and q. 

One can see from Equation (14) that if one of the two con- 
stituent signals presents only noise that is common to both 
quantities x and y, e.g.,: 

x = p + c 

(15) 

y = c 

then the weight given to the constituent with the added noise 
(x in the example) will be zero. This fact can be used to predict 
when the CNS might reconstruct a representation of the task in 
a reference frame different from that of either the target local- 
ization or the feedback about the motor response. If the task 
allows for a direct comparison of target and effector informa- 
tion, e.g., when moving the hand to a remembered posture, the 
reconstructed comparison will contain all the variability of the 
kinesthetic inputs plus the noise added by the coordinate trans- 
formations while the direct comparison will contain no noise that 
is not also included in the reconstructed comparison: 

2 2 2 2 2 

a AV — a T,K + a H,K + a T,Kh^V + a H,Kh^V 

(16) 

2 2 2 

a AK — a T,K + a H,K 

Applying Equation (14) means that the comparison of the recon- 
structed signals, AV will be given no weight compared to the 
direct comparison AK. In other words, there is no advantage 
to transforming the task into an alternate reference frame (e.g., 
in visual space) in this situation. On the other hand, if the tar- 
get and hand are sensed in two different reference frames, such 
that at least one sensory transformation is required, then recon- 
struction into a third reference frame might be beneficial. For 
example, if one is asked to reproduce with the right hand the 
remembered orientation of the left, a transformation will have 
to be applied to compare the hand orientation between the two 
limbs (see Figure 3), leading to the equations: 

^2 _ 2 , 2 , 2 , 2 

a AV — T,Kl H,K R "+■ a T,K L i-+V H,Krm>-V 

a AK L = a T,K L + a H,K R + a H,K R h+K L ( 17 ) 

2 2 2 2 

a AK R = a T,K L + a H,K R + a T,K L h^K R 

where Kl and Kr represent the kinesthetic information about 
the left and right hand, respectively. In this situation, each rep- 
resentation of the task, including representation that includes no 
direct inputs (AV) includes at least one source of noise that is 
independent from each of the others. Thus, one might expect 
to find that the task is carried out simultaneously in the intrin- 
sic reference frame of each arm, and also in visual space. Indeed, 
when we compared precisely these two situations (matching the 



posture of the right hand to the remembered posture of the left 
versus matching the posture of the right hand to the remembered 
posture of the right hand) we observed exactly this behavior. 
The unilateral task showed no effect of deviations of the visual 
field, while the bilateral task did. This same reasoning can also be 
applied to a number of examples from the literature to explain 
why subjects appeared to reconstruct a visual representation of a 
task that could conceivably be carried out entirely in kinesthetic 
space (Pouget et al, 2002b; Sober and Sabes, 2005; Sarlegna and 
Sainburg, 2007; McGuire and Sabes, 2009; Jones and Henriques, 
2010). Explicitly including the co-variation of reconstructed vari- 
able therefore increases the predictive value of the model structure 
depicted in Figure IB. 

4. THE TIMING OF SENSORY RECONSTRUCTIONS 

If one accepts the idea that the CNS transforms sensory infor- 
mation amongst multiple reference frames, one might also ask 
the question, when do such transformations occur? A number of 
studies have considered the performance of cross-modal trans- 
formations for the computation of a movement vector during 
planning (Sober and Sabes, 2003, 2005; Sarlegna and Sainburg, 
2007; McGuire and Sabes, 2009; Burns and Blohm, 2010), but this 
is not the only time when such transformations may be needed. 
Sensory information about the target and limb continues to arrive 
throughout the movement, and the same issues about reference 
frames and sensor fusion arise when considering on-line correc- 
tions that are made based on this information. This question is of 
particular interest when one considers movements to memorized 
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targets. In a V-K task, for instance, which is a task that requires 
at least one cross-modal sensory transformation, what happens 
if the target disappears before the reaching movement is started? 
How is the information about the target stored? Is it encoded in 
memory in visual space, to be transformed into kinesthetic space 
for comparison with proprioceptive information from the arm? 
Or is it immediately transformed into kinesthetic space and stored 
during the memory delay for later use? 

The results of one of our recent experiments (Tagliabue 
et al, 2013) can be used to address this question. In that study 
we analyzed the V-K tasks alluded to above and illustrated in 
Figure 4. We asked subjects to perform this task in two different 
conditions, which differed only in terms of the timing of head 
movements. In one condition (U-T) subjects memorized the 
target with the head upright and produced the motor response 
with the head tilted. In the other condition (T-U) they memo- 
rized the target with the head tilted and moved the hand with the 
head upright. The rationale for performing this experiment with 
head tilted at different times is based on the notion that trans- 
formations between visual and kinesthetic space are disrupted 
(noisier) when the head is not aligned with gravity (Burns and 
Blohm, 2010; Tagliabue and Mclntyre, 2011). This assumption 
is supported by a study of orientation matching between a 
visual and haptic stimuli (Mclntyre and Lipshits, 2008). Whereas 
tilting the subject's entire body had no effect on visual- visual 
and haptic-haptic comparisons, responses were more variable 



in the case of a visual-haptic comparison when the body was 
tilted versus when it was upright. The fact that the inter-modal 
comparison became more variable, but not the intra-modal ones 
indicates that it is the transformation between sensory modalities, 
and not the actual sensory inputs, that are noisier when tilted 
with respect to gravity. In light of this fact, the relative weight 
given to visual information (Xy) in our more recent experiment 
and the overall variance (o 2 A ) will depend on whether each trans- 
formation is performed with the head upright or with the head 
tilted. 

One can therefore differentiate between the different hypothe- 
ses Figure 4 as follows. For a V-K task we have: 



J AV 



U T,V 

2 2 
: a T,V + a T,Vh^K 



2 2 
+ a H,K + a H,Kh^V 



(18) 



H,K 



Taking into account the co -variation between a transformed sig- 
nal and its source, as described in section 3, one can compute the 
weight given to the visual comparison: 



a T,Vh^K 



a T,Vh+K + ° 



2 

H,Kh^V 



(19) 



and given the formula for the variance of a weighted sum of two 
variables that are not independent: 



A Upright - Tilted (U-T) 
B Tilted - Upright (T-U) 





Response 



1 Li 

Head Tilt 

Information flow model 



FIGURE 4 | Experimental manipulation of transformation noise 
(modified from Tagliabue et al., 2013). Two different experimental 
conditions are illustrated in which the subjects were asked to memorize the 
orientation (0) of a visual target (red bar) and to reproduce it, after a delay, 
with their unseen hand. (A) In one condition (U-T) subjects memorized the 
target with the head upright and responded with the head tilted. (B) In the 
other condition (T-U), the target was memorized with the head tilted and 
the hand oriented with the head upright. On the right side of the figure are 
depicted the predictions of the Concurrent Model for each of the two 
experimental conditions. As in Figures 1, 2, and blue arrows represent 
visual and kinaesthetic signals, respectively and green arrows represent 
cross-modal transformations. Gray symbols represent sensory inputs that 
are absent. Because having the head tilted (yellow areas) causes 
cross-modal transformations to be significantly noisier, comparisons 
requiring such transformations are given less weight (faded green arrows) 
and comparisons for which sensory reconstructions are performed with the 
head upright are privileged. 



ax+by 



a 2 o 2 + b 2 ol + 2ab cov x , 



(20) 



the overall variance of the optimal estimate will be: 



a\ = Xya| v + (1 -X v ) 2 o 2 AK + 2X V (1 - ky) cov av ,ak (21) 



(22) 

2 2 

2.2 , g T,Vh+K g H,Kh+V , , 

= a T,V + a H,K + ~2 — ~2 ( 23 J 

a T,Vh^K + a H,Kh+V 



Now assume that the noise added when transforming from 
visual to kinesthetic or from kinesthetic to visual is the 
same, for a given orientation of the head, and that head tilt 
has the same additive effect on all transformations, i.e., we 
define: 



2 2 2 

a T,Vh^K — a H,Kh^V — a i-> 



(24) 



when the transformation is performed with the head upright, 
and: 



2 2 2 2 

Q T,Vh^K = a H,Kh^V — a h^ + ° II 



(25) 



when the transformation is performed with the head tilted to 
the side. Combining Equations (18-25), one can see that tilting 
the head will have no effect on Xy if both transformations are 
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performed with the head upright or both are performed with the 
head tilted: 



up,up 



^Vltilt,tilt = 



+ a 



(26) 



up,up 



^vltilt,tilt 



transformation is performed with the head tilted and one with 
the head upright: 



oh (°h + a //) 



<*A lup,tilt — a T,V + a H,K +1,7,9 

oh+oh + O 



II 



2 1 , K+^K 

a A |tilt,u P - a T , v + a H , K + — - - — 



a A lup,tilt — °A I tilt 



up 



(29) 



Performing both transformations with the head upright or both 
with the head tilted will, however, have an effect on the overall 
variability: 



>Alup,up 



2 2 
: a T,V + a H,K + 



2 2 

oh+< 

2 



o; 



A I Z I I— T 

— Oj y i- a H K i — 



H,K 



a A ltilt,tilt = cr T v + cr H K + 



II 



(27) 



„2 ,„2 , 

CT T,V a H,K ^ 



J // 



°A ltilt,tilt — °Alup,up + 



a Altilt,tilt > ^Alup,up 



On the other hand, if one of the transformations is performed 
with the head upright, and the other with the head tilted, the 
opposite pattern should be observed. The weight given to visual 
information will depend on whether the transformation T,V i-> 
K is performed with the head upright and the transformation 
H,K V is performed with the head tilted (up,tilt), or vice versa 
(tilt,up): 



^vlup,tilt 



Oh + Oh + CF 2 



Oasa 



// 



// 



oo 



^ V I tilt,i 



up 



+ 

oh + o 2 n + oh 



1 as a 



(28) 



// 



^vlup,tilt < ^vltilt,up 



while one would expect to see similar levels of overall vari- 
ability between the two conditions, because in both cases one 



T,VWK ^ a H : 



.Kh+V 



for 



Note that the results remain valid even if a; 

plausible values of a 2 :Vh ^ K , aj^ Kh ^ V ' a TV a H K an< ^ a //- 

Using these mathematical considerations and the results of our 
experiment, one can distinguish between the three hypotheses 
about the timing of sensory reconstructions shown in Figure 5. 
If the movement vector is computed while the target is still visible 
(Figure 5A), then both transformations (T,V K and H,K 
V) will be performed with the head upright in the U-T con- 
dition and both will be performed with the head tilted in the 
T-U condition. According to Equations (26 and 27), the relative 
weight given to visual information should not change between 
the U-T and T-U conditions, while the overall variance should 
be greater for T-U than for U-T. Neither of these predictions is 
consistent with our empirical results in which we observed a sig- 
nificantly greater weight given to visual information in the T-U 
condition, compared to U-T, and similar levels of overall vari- 
ability for both (Tagliabue et al., 2013). Note that this hypothesis 
can also be rejected by the strong effect of response modality 
that we observed in our previous study (Tagliabue and Mclntyre, 
2011). In all conditions tested in that study (K-K, K-VK, K-V, V- 
K, V-VK, and V-V) the subject's hand was outside the field of view 
during the time when the target was being presented. Therefore in 
all conditions the information available about the hand's orienta- 
tion during target observation was de facto the same. If Figure 5A 
were correct, we would not have observed the strong effect of 
response modality on the weight given to visual versus kinesthetic 
information. 

Figure 5B depicts an alternative hypothesis by which the CNS 
performs the requisite coordinate transformations starting at 
movement onset, relying on visual memory of the target after 
it disappears. In this case both transformations (T,V \-+ K and 
H,K V) would be performed with the head upright in the T-U 
condition and with the head tilted in the U-T condition. Applying 
once again Equations (26 and 27), one would expect to see sim- 
ilar weight given to visual information in both conditions and a 
significant difference in the overall variability, although according 
to this hypothesis, the higher variability would occur for U-T. As 
before, the empirical observations (Tagliabue et al, 2013) do not 
match the predictions of Figure 5B. 

Our experimental findings can, however, be reconciled with a 
hypothesis by which cross -modal reconstructions of target and 
hand occur continuously, but only long as the sensory input to be 
transformed is present (Figure 5C). When the target disappears, as 
in our experiments, further reconstruction of its kinesthetic ori- 
entation from visual information is halted, and the remembered 
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FIGURE 5 | Timing of cross-modal reconstructions. Hypotheses 
concerning the time course of sensorimotor reconstructions are represented 
for the task depicted in Figure 4. The visibility of the target (purple bar) and 
the tilt of the head (yellow bar) are shown as time progresses from left to 
right. The hand moves only after the rotation of the head is terminated. 
Horizontal lines represent internal representations of the target (Ot,v and 0t,k) 
and of the hand (0h,v and 6h,k)- Gray symbols indicate sensory inputs that are 
absent, while green arrows indicate cross-modal reconstructions that may be 
performed. Vertical arrows and nodes indicate when the comparisons of 
target and hand are carried out, according to three hypotheses: (A) 
Cross-modal reconstructions and concurrent target-hand comparisons 
(A9 V ,A0 K ) are performed while the target is visible and the resulting 



movement vector (AG) is maintained and updated through the end of the 
movement. (B) Cross-modal reconstructions are performed during 
movement execution, relying on sensory inputs about the target stored in 
memory. (C) Cross-modal reconstructions are performed continuously as 
long as the sensory input is present; direct and reconstructed target 
representations are maintained in memory in parallel through the end of the 
movement. Faded nodes indicate target-hand comparisons that are noisier 
because they rely on cross-modal reconstructions that were performed with 
the head titled. Hypotheses (A,B) predict similar weighting of visual and 
kinesthetic information, and thus partial deviations of the response in both 
the U-T and T-U conditions, while hypothesis (C) predicts a significantly larger 
weighting of the visual comparison in the T-U than in the U-T conditions. 



orientation is maintained in both spaces. Transformation of the 
continuously available hand kinesthesia into the visual domain 
proceeds, however, through the end of the movement. Here we 
fall into the situation in which the sensory transformations poten- 
tially used to control the movement do not all occur with the head 
at the same orientation. In the U-T condition, the last transfor- 
mation of the target into kinesthetic space will occur with the 
head upright, while the latest transformations of the hand into 
visual space will occur throughout the movement, i.e., with the 
head tilted. Conversely, in the T-U condition, the last transforma- 
tion of the target will occur with the head tilted, and the latest 
transformations of the hand with the head upright. Applying 
Equations (28 and 29), one expects to see a greater reliance on 
visual information in T-U than in U-T, with similar levels of 
overall variability between the two conditions, precisely as we 
observed (Tagliabue et al, 2013). 

To summarize, we have shown that the reconstruction of 
sensory signals in alternate reference frames appears to occur 
only while the primary sensory input is available. An important 



corollary to this conclusion is that the CNS will also store spa- 
tial information concurrently in multiple reference frames, a 
prediction that can, in theory, be tested experimentally. 

5. GENERALIZED CONVERGENT AND CONCURRENT 
MODELS 

In the preceding sections we have discussed how the CNS 
might benefit from performing multiple, concurrent compar- 
isons when, for instance, bringing the hand into alignment with 
a target. This discussion has highlighted a number of pertinent 
issues, including the evidence for single versus multiple com- 
parisons, the importance of considering co -variation of signals 
when computing weights based on maximum likelihood and the 
timing of inter-modal transformations. The preceding sections 
leave open a number of questions, however, about when the var- 
ious input signals are combined and about how to extend these 
concepts to situations where more than two sensory modalities 
may be involved. In this section we will formalize the distinction 
between convergent versus concurrent structures. In the section 
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FIGURE 6 | Generalized models. Three possible formulations of the 
sensorimotor integration model, all based on principles of maximum 
likelihood. (A) Fully-convergent model: first, the position of the target and 
hand are optimally estimated independently; then, these optimal 
estimations are compared in a unique reference frame to compute a single 
movement vector Ax. Green arrows represent the sensory transformations 
necessary to encode the signals in a common reference frame before they 
can be combined. (B) Hybrid-convergent/concurrent model: available 
sources of information about the target are combined to build optimal 
estimations of its position in different reference frames. Ditto for the 
available hand information. Target-hand comparisons are then performed in 
each of these reference frames (Ax v and Ax K ) and the results of these 
comparisons are optimally combined to produce the net movement vector 
(Ax). (C) Fully-concurrent model: available sources of information about 
target and hand are used to build concurrent target-hand comparisons in 
various reference frames. Information directly available in a given reference 
frame can be compared with both information directly available in the same 
reference frame and with information reconstructed from signal initially 
encoded in a different reference frame. It follows that, for each reference 
frame, all combinations between direct (d) and reconstructed (r) signals 
may be used to perform comparisons: AV^, AV^, etc. 



that follows we will show how the various computational con- 
cepts can be broadened to include questions such as how the CNS 
makes use of intrinsic versus extrinsic reference frames. 

5.1. FULLY CONVERGENT MODEL 

Figure 6A shows the computational structure of the fully con- 
vergent model. A maximum likelihood estimate is made from 
all available inputs about the target's position and a similar pro- 
cess is applied to all available information about the position 
of the hand. As pointed out in section 2.1, the various sources 
of information must be transformed into a common reference 
frame in order for these optimal estimates to be computed and 
these transformations add noise. The calculations that describe 
the convergent model are therefore given by: 

n m 
AX=J2 ™T,i*i-+T (x T ,i) - W HJ^r (*H,j) (30) 
i=l j=l 

where xr,i and xn,j are the sensory inputs about the target 
position in reference frame i and the hand position in reference 
frame j. Each input is associated with its own intrinsic variability 
(a j ■ or cr^j ■)■ The operator ^ a ^r represents the a transformation 
of a position value from some reference frame a into the common 
reference frame r. Applying ty a -+r to an input value expressed in 
its intrinsic coordinate frame a creates a new value in the reference 
frame r with noise equal to the sum of the variance of the input 
(e.g., Oj ) and the variance added by the transformation (o 2 a _^ r ). 
Note that the common reference frame r could be some abstract 
reference frame that is independent from any given sensory frame, 
or it could be one of the n reference frames intrinsic to the sensory 
modalities used to sense the target position or one of the m ref- 
erence frames used to sense the hand position. In this latter case, 
no transformation will be required for at least one sensory input, 
and we define ^ r ^ r (x) = x and a^ r = 0. 

5.2. HYBRID CONVERGENT/CONCURRENT MODEL 

According to the model presented in Figure 6B, it is presumed 
that the CNS will use all available information to represent the 
task in each of the component reference frames, and will then 
concurrently compare the target to the hand within each reference 
frame, before combining the results of each comparison to drive 
the motor response. We base this formulation on the model pro- 
posed by McGuire and Sabes (2009) for the combination of visual 
and kinesthetic information. From their discussion: movements 
are always represented in multiple reference frames, and from the 
Methods: the model first builds internal representations of fingertip 
and target locations in both retinotopic and body-centered reference 
frames. These representations integrate all available sensory sig- 
nals, requiring the transformation of non-native signals. Extending 
these concepts to more than two sensory modalities and reference 
frames, the equation describing this formulation is: 

N in m \ 

Ax = h \J2 "W*HiM ~J2 ^Hj^i (xhj) (31) 

i=i v=i j=i J 



where N is the total number of reference frames for which the 
comparison between target and hand will be made, n ^ N is 
the number of reference frames in which target information is 
directly available and m ^ N is the number of reference frames 
in which hand feedback is available. Implicit in this formulation 
is the idea that the CNS will always reconstruct sensory sig- 
nals across modalities, even when sensory information is directly 
available within a given modality. One can see that this formu- 
lation allows for two sets of weights, those that determine the 
weight given to direct and reconstructed inputs within each ref- 
erence frame [wjjj and w^h ,j> comparable to the weights w 
described in the convergent model of Equation (1)] and those 
used to combine the results of the differences computed in each 
reference frame [comparable to the weights X in the concur- 
rent model of Equation (2)]. So, for instance, if both visual and 
kinesthetic information is available about the target, both the 
direct visual input and a transformed version of the kinesthetic 
information will be used to construct a representation of the tar- 
get in visual space. Similarly, both the direct sensory input and 
the reconstructed visual input will be used to construct a rep- 
resentation of the target in kinesthetic space. The weight given 
to each source of information, however, will take into account 
the noise added by the cross-modal transformations. Thus, the 
representation of the movement in visual space will give more 
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weight to the direct visual input than to the visual representation 
that is reconstructed from kinesthetic signals, etc. According to 
this model, the CNS will read out the desired movement vector 
by combining the differences computed concurrently in each ref- 
erence frame, also according to the expected variance of each of 
the differences. 

5.3. A FULLY-CONCURRENT MODEL 

Here we propose a third formulation, shown in Figure 6C, based 
on the concept that individual comparisons form the build- 
ing blocks for multisensory control of hand- eye coordination. 
According to this proposal, each available sensory input may be 
transformed into any and all other potential reference frames, 
as in the hybrid model described above. The two models differ, 
however, in terms of how the various reconstructions are handled 
within each reference frame. According to the fully concurrent 
model, the direct and reconstructed signals are not combined 
into a single representation of the target and of the hand within 
each reference frame. Rather, the CNS would compute individ- 
ually the differences between all possible permutations of target 
and hand representations, both direct and reconstructed, within 
each reference frame, on a pair-by-pair basis. Only then would 
the results of all the individual differences be combined through 
a weighted average according to MLE in order to compute the 
movement vector. The computations that describe such a fully 
distributed, concurrent model, based on individual differences 
can be described by: 

N n n 

Ax= J2J2J2 Vy. fc M - *Ahm (*h,k)) (32) 

i=\j=\k=\ 

A simple mathematical convenience serves to adapt Equation (32) 
to situations where direct sensory inputs about the target or the 
hand are missing in one or more of the n sensory modalities. 
According to MLE, a given signal is weighted according to the 
inverse of its expected variance. If the quantity 1 /a 2 is a measure 
of the confidence that one has in a given signal — i.e., the greater 
the variability, the lower the confidence — one can therefore assign 
to a missing sensory input an infinite variance, in the sense that 
the confidence in a missing signal will be 1/a 2 = l/oo = 0. By 
doing so, the weight given to a missing input, or to a transformed 
version of a missing input will automatically fall to zero in the 
calculations derived from MLE. 

Note that Model 6C is "fully connected", allowing for the 
possibility that, for instance, the CNS will reconstruct and com- 
pare kinesthetic signals in a visual reference frame even though 
both target and hand may be visible. This means that there 
may be multiple comparisons of the target and hand within 
any one reference frame due to the reconstruction from more 
than one other reference frames. Nevertheless, given the noise 
inherent to the reconstruction, the application of MLE will favor 
the comparison of the directly sensed visual signals within the 
each reference frame, when such direct information is avail- 
able. Indeed, some components may drop out of the equation 
because MLE gives them a weight of zero, as we will see in the 
following. 



6. EXTRINSIC REFERENCE FRAMES 

In the examples given above we have focused mainly on intrinsic 
reference frames native to the sensory modalities used to local- 
ize the target and the hand. This is due in part to the fact that 
the most widely documented studies of sensor fusion for eye- 
hand coordination, including those cited above, have considered 
two main reference frames: retinal for visual information and 
body centered for kinesthetic (a.k.a. proprioceptive) informa- 
tion. Depending on the task, however, other non-native reference 
frames are almost certainly of interest. For instance, ample evi- 
dence exists for the encoding of limb movements (Soechting and 
Ross, 1984; Darling and Gilchrist, 1991; Borghese et al, 1996; 
Luyat et al, 2001; Darling et al, 2008) or visual stimuli (Asch and 
Witkin, 1948b; Luyat and Gentaz, 2002) in a gravitational refer- 
ence frames, as well as the encoding of information with respect to 
visual landmarks (Asch and Witkin, 1948a). In the following we 
examine the question of whether or not to make use of extrinsic 
reference frames in the context of each of the three models shown 
in Figure 6. 

The convergent model of Figure 6A can accommodate the 
recoding of a sensorimotor task by realizing a change in the 
common reference frame r. Thus, the CNS may choose to com- 
bine sensory inputs in one possible reference frame or another, 
depending on the task conditions. Nothing in Equation (30), 
however, says anything about how r is chosen. Additional rules, 
not specified in Equation (30), would have to be found to resolve 
this outstanding question. As such, Model 6A is incomplete. 
Models 6B,C provide more elegant solutions to this question. An 
astute reader will have noticed the distinction between the lower- 
case n and m in Equations (31 and 32), representing the number 
of sensory inputs, from the uppercase N indicating the num- 
ber of reference frames in which the comparison of target and 
hand is performed. These numbers could all be the same, but the 
two formulations allow for the use of additional reference frames 
not directly linked to a sensory input as well. According to these 
equations, each sensory input maybe reconstructed in additional, 
non-native reference frames. Candidates include other, derived 
egocentric references such as the head or the shoulder or with 
respect to external references such as gravity or visual landmarks. 

From the perspective of minimizing variability, however, 
recoding of sensory information in a non-native reference frame 
would not necessarily be advantageous, because the transforma- 
tion of the information from a native to a non-native reference 
introduces additional noise. For instance, the variability of a 
visual target encoded with respect to gravity will include the vari- 
ability of both retinal signals and of graviceptors. Moreover, all the 
variance of the target-hand comparison in the retinal reference 
frame will be included in the comparison encoded in the external 
reference frame. According to the analysis presented in section 
3 the weight given to the external representation would drop to 
zero. One might therefore surmise that the recoding of spatial 
information in non-native reference frames will be avoided, when 
possible, in deference to direct comparisons of sensory informa- 
tion within the intrinsic reference frame of the different neural 
receptors. As we will show in the following examples, however, 
the native sensory representations may be affected by additional 
sources of noise, depending on the circumstances. The principle 
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FIGURE 7 | External reference frames. Example of how external sources 
of information, such as gravity and the visual scene, can be combined to 
build external encodings of initially retino-centric signal about the target, 
x T ,v and the hand, x H ,v- Open circular and square nodes represent the 
recoding of information with respect to an external reference (circles) or the 
updating of egocentric information to account for movements of the body. 
All other symbols for inputs and transformations are as defined in previous 
figures. (A) If no movement occurs after the memorization of the retinal 
information about the target, its direct comparison with the retinal signal 
about the hand is possible, therefore encoding these signal with respect to 
the external gravitational and scene references would not reduce 
movement variability. (B) If the head moves in space, or if the eye moves 
within its orbit, a direct comparison between retinal signals about the target 
and hand is not possible, because the retinal information about the target 
must be updated to take into account the sensor movement. In this case, 
encoding the initially retino-centric signals with respect to the gravity and 
visual scene become advantageous, because the egocentric and the 
external encodings become partially uncorrelated. 



of maximum likelyhood coupled with the concurrent structures 
of Models 7B,C, can then predict which of the N reference frames, 
intrinsic or extrinsic, come into play in any given situation. 

6.1. EXTERNAL REFERENCE FRAMES 

Figure 7 shows an example of how the concurrent models may be 
applied to the question of whether or not to make use of an exter- 
nal reference frame for a given task. The model predicts that if 
the target and the hand can be sensed through the same modality 
and no movement of the sensor occurs between target memoriza- 
tion and response (Figure 7A), the brain should privilege a direct 
egocentric encoding of the movement. Since the transformation 
into the alternative reference frame would add noise, maximum 
likelihood will give the most weight to the direct comparison. This 
effect is amplified if one considers the co -variation between direct 
and reconstructed signals. Because a comparison performed in 
any other reconstructed reference frame would co-vary precisely 
with the inputs to the direct comparison, performing these addi- 
tional encodings would not reduce the variability of the move- 
ment at all. On the other hand, if a movement occurs after the 
target is stored in memory (Figure 7B), an egocentric memory 
of the target would need to be updated to account for the sen- 
sor displacement (Droulez and Berthoz, 1992; Duhamel et al., 
1992; Medendorp et al., 2008). In this situation, reconstructing 
additional, external encodings of the movement becomes advan- 
tageous, because the noise added by the updating of the intrinsic 
representation becomes comparable to the noise added when 
reconstructing in an external reference frame. This is especially 
true when the noise in the information used to update the ego- 
centric representation of the target and the noise in the signals 
used as external references are independent. 

The parallel structures of Models 6B,C are interesting because 
they provide a theoretical basis for using a combination of 
intrinsic and extrinsic reference frames, which appears to well 
correspond to behavioral (Burgess et al., 2004; Vidal et al., 2004; 
Burgess, 2006; Byrne et al., 2010) and physiological (Dean and 
Piatt, 2006; Zaehle et al, 2007) evidence. Indeed, in a task of 
reaching with the outstretched hand for a visual or kinesthetic 
target, with visual or kinesthetic feedback about the response, or 
both, we were unable to reconcile empirical data with a com- 
putational model that relied on intrinsic reference frames alone 
(Tagliabue and Mclntyre, 2012). We surmised that due to the 
movement of the head in our experiment, subjects encoded the 
task in external reference frames as well. Psychophysical studies 
have also shown that subjects tend to use egocentric represen- 
tations if they remain stable after memorization, but they com- 
bine egocentric and external representations if their body moves 
(Burgess et al, 2004; Burgess, 2006). Similarly, during reaching to 
visual targets, external visual landmarks appear to be neglected if 
the hand visual feedback is reliable; whilst they are integrated to 
build an allocentric representation of the movement if the hand 
visual feedback was absent or unpredictable (Obhi and Goodale, 
2005; Neelyetal.,2008). 

6.2. MEMORY 

The need to store target information in memory for some time 
before the movement occurs can also motivate the transformation 



of sensory information into a non-native reference frame. In 
eye-hand tasks with imposed memory delays, the variability of 
responses tends to increase with the length of the delay (Mclntyre 
et al., 1997, 1998). Thus, the simple act of storing spatial informa- 
tion in memory adds noise. According to the hypothesis related 
in section 4, the target location will be stored in memory simulta- 
neously in more than one reference frame. Assuming that each 
representation of the remembered target position will degrade 
independently (i.e., each will accumulate noise that is stochasti- 
cally independent from the other), it becomes more and more 
interesting, a maximum likelihood perspective, to make use of 
the non-native representations, despite the added cost of recon- 
structing those representations in the first place. This reasoning 
is supported by a study in which subjects were asked to point 
to targets located along a straight line in 3D space (Carrozzo 
et al., 2002). As the memory delay increased, patterns of vari- 
ability of the pointing position were more-and-more constrained 
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by the extrinsic reference provided by the direction of the line 
in 3D space. This can be interpreted as a shift in weighting 
between egocentric and allocentric reference frames, even when 
the body does not move. By simply substituting "memory pro- 
cesses" for "head/ eye movement", however, Figure 7 can be used 
to understand why the CNS may rely more on the encoding of 
a task in a external reference frame when memory processes are 
involved. 

7. DISCUSSION 

In this paper we have described three analytical models (see 
Figure 6) that share a number of defining features. One of these, 
the idea that the CNS can express spatial information in multiple 
reference frames while transforming information between them, 
is a common theme that is supported by numerous theoretical 
and experimental studies. To cite a few examples, Droulez and 
Cornilleau-Peres (1993) proposed a distributed model of "coher- 
ence constraint" by which spatial information may be encoded 
in reference frames intrinsic to each sensor and they described 
a computational structure by which information from one sensor 
can be reconstructed based on redundant information from other 
sensors when the primary source is not available. Bock (1986) 
identified a phenomenon of bias when pointing to targets that lie 
at a location peripheral to the center of gaze. This phenomenon 
has been used in a number of studies to argue that whether point- 
ing to visual, auditory or even proprioceptive targets, the CNS 
carries out the task in retinotopic coordinates (Enright, 1995; 
Henriques et al., 1998; Pouget et al., 2002b). These observations 
can be linked to neural properties through models that solve the 
problem of recoding information in different reference frames 
by using basis functions and attractor dynamics (Pouget et al., 
2002a) or restricted Boltzmann machines (Makin et al., 2013). 

The premise that the CNS combines sensory information 
based on relative variance has also found considerable experimen- 
tal support: van Beers et al. (1996) showed that the precision of 
pointing movements increased when the subject could use both 
visual and kinesthetic feedback signals, compared to when only 
one sensory feedback modality was available. They also showed 
that the relative weight given to the two sensory signals depended 
on their relative variability (van Beers et al., 1999). Ernst and 
Banks (2002) varied experimentally the noise in the sensory sig- 
nals available to subjects when they grasped a virtual object 
that provided both visual and haptic cues about size. Using ver- 
bal judgments, they showed how the overall perceptual response 
shifted toward the haptic information when the precision of the 
visual inputs was degraded. Smeets et al. (2006) assumed that 
the CNS maintains both a visual and a kinesthetic representation 
of targeted movements. When vision of the hand was allowed, 
this sensory modality dominated due to its higher precision. But 
when vision of the hand was occluded and subjects were asked 
to make consecutive movements, the authors observed a grad- 
ual shift toward a reliance on proprioceptive information, as 
indicated by gradual drift in the direction of biases that are specif- 
ically associated with this modality. They attributed this shift to 
a re-weighting toward proprioceptive information as the visual 
representation of the occluded hand degrades over the course of 
sequential movements. 



These themes of transformations and maximum likelihood 
come together when one considers the noise added when con- 
verting sensory information from one reference frame to another. 
As alluded to in section 2.1, the added noise inherent to sen- 
sory information that is reconstructed from other sources will 
cause a shift toward the alternative, directly sensed information. 
This principle has given rise to other empirical manifestations: 
Sober and Sabes (2003, 2005) postulated that the CNS combines 
visual and proprioceptive information at two different stages in 
the planning of targeted hand movements. First, the movement 
vector is calculated in visual space as the difference between the 
position of the visual target and the initial position of the hand. 
Kinesthetic information about the hand's position is also used 
at this stage, but because it must be transformed into visual 
space, it is given much less weight, in accord with MLE. At a 
second stage, the visual movement vector is converted into a 
motor vector, based primarily on proprioceptive information, but 
also accommodating a weaker influence of visual information 
about the target, hand and limb configuration transformed into 
motor coordinates. Burns and Blohm (2010), using the same 
model structure as Sober and Sabes, observed a reduction of 
the weight given to proprioceptive information in the calcula- 
tion of the movement vector during planning when the head was 
tilted in a V-VK task. They attributed the shift to the fact that 
(a) the movement vector was calculated in visual space, requir- 
ing that the proprioceptive information about hand position be 
transformed in order to be useful and (b) tilting the head with 
respect to gravity increases the noise added by manual-to-visual 
transformations, thus further decreasing the weight given to the 
reconstructed signals. Tagliabue et al. (2013) examined the effects 
of head tilt on the weighting of sensory information. In a V-K task 
(Figure 4), if the head was tilted during target acquisition, but 
not the motor response, the CNS gave greater weight to the visual 
representation, presumably because transforming the visual tar- 
get into kinesthetic space with the head tilted would be much 
noisier than transforming kinesthetic information about the hand 
into visual space with the head upright. Conversely, if the head 
was held upright when the target was acquired, but the head 
was tilted during the motor response, then the task was carried 
out in kinesthetic space so as to avoid the kinesthetic-to-visual 
transformation that would have to occur while the head was tilted. 

Although the three computational models of Figure 6 share 
a number of features, as described above, they vary in terms 
of the level of convergence or parallelism in the processing of 
sensory information. Model 6A presents the highest level of con- 
vergence, combining all available inputs about the target and all 
available inputs about the hand before calculating a movement 
vector based on the two optimal estimates. Model 6A provides no 
clue, however, as to what is the common reference frame for any 
given task, nor how the common reference frame might change 
from one task to another. Models 6B,C provide more elegant 
solutions to this question by allowing the comparison of target 
and hand to be carried out simultaneously in multiple reference 
frames. The same rules that determine which sensory inputs will 
dominate in any given situation (maximization of likelihood) also 
determine the weight given to the comparison carried out in each 
of the component reference frames. The computational scheme 
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depicted in Figure 6B combines features of both the convergent 
model of Figure 1A and the concurrent model of Figure IB. 
Whereas multiple comparisons of target and hand are performed 
in different reference frames, one can see nevertheless that there 
is a convergence of multimodal sensory signals about the target 
and about the hand before these two quantities are compared 
(subtracted) within each reference frame. In contrast, Model 6C 
combines the results of binomial comparisons of a single sensory 
input about the target (direct or reconstructed in another ref- 
erence frame) with a single sensory input about the hand (also 
direct or reconstructed). Model 6C is the least convergent of the 
three and as such, lends itself to a modular approach to sensory 
integration for the coordination of eye and hand. 

7.1. MODEL PREDICTIONS 

Which of the three models depicted in Figure 6 best represents 
human sensorimotor behavior and the underlying neurophysiol- 
ogy? The three computational structures that we have compared 
here can be distinguished on theoretical grounds and the dif- 
ferences between them lead to testable hypotheses, both at the 
behavioral level and in terms of the neural implementation as 
measured by electrophysiological or other methods. 

7. /. /. Fully convergent vs. concurrent 

The question as to whether sensory signals are combined in 
a unique reference frame that is defined a priori (i.e., in line 
with Figure 6A) prior to performing the comparison between 
hand and target has received considerable attention in recent 
years and can, perhaps, already be rejected. From a Bayesian 
perspective, it can be argued that it is advantageous to main- 
tain multiple representations of movement parameters, expressed 
in diverse reference frames, in order to optimize motor perfor- 
mance. Electrophysiological evidence also supports the notion 
that motor planning and execution is carried out in multi- 
ple reference frames in parallel, both across different regions 
of the brain and within a single cortical area (Buneo et al, 
2002; Beurze et al, 2010; Buchholz et al, 2013; Maule et al, 
2013; Reichenbach et al, 2014). At the behavioral level, the 
fully convergent model depicted in Figure 6A cannot predict cer- 
tain experimentally observed characteristics of movement plan- 
ning and execution. As explained in the earliest sections of 
this article (2-2.1), such a computational model cannot explain 
why sensory information about the hand is weighted differently 
between K-VK and V-VK tasks, nor would Model 6A be able 
to predict why the CNS would reconstruct a visual represen- 
tation of kinesthetic pointing task when the task is bilateral, 
but not when it is unilateral (Tagliabue and Mclntyre, 2013). 
Moreover, the combination of parallel comparisons in a vari- 
ety of coordinate systems gives meaning to the concept of a 
hybrid reference frame (Carrozzo and Lacquaniti, 1994). Rather 
than considering that the task is executed in some abstract ref- 
erence frame that has little or no physical meaning, one can 
instead understand that the characteristics of a so-called hybrid 
reference frame may in fact be the manifestation of a paral- 
lel, weighted combination of individual target-hand comparisons 
carried out in reference frames tied to identifiable objects or 
sensors. 



Studies that have explicitly considered sensor fusion in the case 
of reaching or pointing tasks have often assumed, implicitly or 
explicitly, the fully convergent computational structure depicted 
in Figure 1A. One such example is the work carried out by van 
Beers et al. (1996, 1999) who postulated that a minimization of 
motor variability could be the driving factor behind the choice 
of one motor plan over another. They explicitly refer to a con- 
vergent maximum likelihood model structure along the lines of 
Equation (1). The work by Smeets et al. (2006) included the 
assumption that the CNS maintains both a visual and a propri- 
oceptive representation of the hand and of the target, but did not 
include any explicit consideration of the transformation of visual 
information into proprioceptive space or vice versa. Furthermore, 
the equations that the authors used to make the model pre- 
dictions in that study would appear to adhere to the computa- 
tional structure evoked by the convergent model described by 
Equation (1). Nevertheless, the structure of concurrent compar- 
isons described by Equation (2) can also accommodate both of 
these studies, without contradiction. Thus, even though Equation 
( 1 ) has been used on occasion to explain the results of a number 
of studies, the ability of Equation (2) to explain those studies, 
and to also explain the effects of target modality that cannot 
be explained by Equation (1) means that Equation (2) pro- 
vides a more parsimonious explanation of human sensorimotor 
behavior. 

7. 1.2. Hybrid concurrent/convergent vs. fully concurrent 

Experiments testing the two concurrent hypotheses 
(Figures 6B,C) have been performed by various groups and 
reported in the literature. We believe that the hybrid formulation 
of Equation (31) is representative of the model proposed by 
McGuire and Sabes (2009). These authors used a more sophisti- 
cated Bayesian analysis to formulate their hypothesis, but as they 
point out, the convolutions required to represent a coordinate 
transformation in Bayesian notation are simply additions or 
subtractions and if there is no prior to be taken into account, the 
posterior is proportional to the likelihood. This model has been 
used to interpret a number of empirical results (McGuire and 
Sabes, 2009, 2011; Burns and Blohm, 2010). In our own studies 
and publications, we have implicitly used the computational 
structure of Equation (32) to interpret the results of a series 
of experiments on multi sensory integration (Tagliabue and 
Mclntyre, 2008, 2011, 2012, 2013; Tagliabue et al, 2013). But 
whereas both models have been used with success to explain a 
wide range of empirical results, the differentiation between the 
hybrid concurrent/convergent formulation of Figure 6B and 
the fully concurrent formulation in Figure 6C has not, to our 
knowledge, been explicitly taken up in the literature. Yet it should 
be possible to distinguish between the two mechanisms, both in 
terms of potential theoretical advantages of one computational 
scheme over the other and in terms of empirical results, as we 
will discuss below. 

One key difference between Figures 6B,C is that of when the 
difference between target and hand is actually computed. In a lin- 
ear system, this distinction is not very important, since Model 
6B can be rearranged algebraically to match Model 6C, and vice 
versa. But evidence suggests that the combination of sensory 
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signals occurs in a non-linear fashion, in part as a means to deal 
with sensory signals that may or may not come from the same 
stimulus or event (Roach et al, 2006; Knill, 2007; Hospedales 
and Vijayakumar, 2009). If sensory signals are separated in dis- 
tance or in time, the Bayesian optimal may be to rely fully on 
one signal or the other, rather than an weighted sum of the two. 
A corollary of these non-linear processes is that as two redun- 
dant signal become more separated, the combined estimate may 
become noisier (Wallace et al., 2004). Model 6C has an advan- 
tage over 6B in this respect. By combining sensory signals only 
after computing the movement vector, disparity between refer- 
ence frames will drop out, provided that the disparity is the same 
for the target and for the hand. One might therefore test this 
hypothesis by artificially modulating the disparity between refer- 
ence frames. The prediction of Model 6C is that such an operation 
will not affect motor precision. 

The question of how the CNS takes into account covariance 
between signals could also provide the basis for favoring one 
model over the other. In Model 6B, the combination of visual and 
kinesthetic information about the target are combined by using 
a "local" optimality criterion, that is by taking into account the 
variability of the signal to be combined (including the necessary 
cross-modal transformations), but neglecting how the resulting 
optimal estimation will be used in later stages. In particular, this 
local optimal weighting of the target information neglects the 
consequences of any covariance that may be generated between 
the two concurrent comparisons AV and AK. The very same 
considerations are valid, of course, for the hand information. It 
follows that the brain could tend to over-estimate the benefit of 
weighting a given signal, because, although it would "locally" 
provide a more precise estimation of the target and of the hand 
positions, "globally" it would increase the covariance between 
AV and AK, and if not corrected, will increase the variance of 
the final output. In other words, generating optimal estimates 
of target and hand does not necessarily lead to optimal targeted 
hand movements. Model 6C, on the other hand, is based on the 
combination of pairwise comparisons of target and hand, with 
maximum likelihood being applied to minimize the variability 
of the combination of multiple movement vectors. Through this 
more modular approach, it is potentially easier to identify and 
adjust for co-variation between movement vectors. 

An example of this is shown in Figure 8, in the case of a V-VK 
task. The hybrid model predicts that both visual and kinesthetic 
information about the hand will be used to construct representa- 
tions of the hand in each of the two reference frames (Figure 8A). 
Due to the inter-modal transformations, the comparison carried 
out in kinesthetic space will be correlated with the comparison 
carried out in visual space. The optimal combination of AV and 
AK will need to be modified to take into account the resulting 
co -variation. Model 6C applied in this situation instead predicts 
that comparison of the visual target position, reconstructed in 
kinesthetic space, with the representation of the hand, recon- 
structed from visual information, will simply drop out, due to 
the co -variance with the direct comparison of target and hand in 
visual space (Figure 8B). One might therefore ask the question, 
will the CNS, like Penelope waiting for Ulysses with her weaving 
(Homer, VIII century BC), perform cross-modal reconstructions, 



only to undo their effects at a later stage (Figure 8 A)? Or, by 
maintaining a more modular approach, can the CNS more effi- 
ciently achieve the optimal solution by performing only those 
transformations and comparisons that are beneficial in any given 
situation (Figure 8B)? 

Of course the ultimate test of the hypotheses presented here 
would be to find correlates of models 6B or 6C in electrophysio- 
logical studies of neuronal activity. Model 6B predicts that one 
should find neurons that respond to multiple sensory inputs 
about the target and similar neurons encoding information about 
the hand. Model 6C makes a novel prediction that certain cells 
will be sensitive to inputs about the target in one (and only 
one) sensory modality but that the spatial information will be 
expressed in the coordinate frame of another. For example, Model 
6C predicts the existence of a cell that encodes the movement 
vector in visual space, even though the cell may be sensitive to 
modulation of proprioceptive, but not visual, signals. This would 
not be the case for Model 6B, where sensory signals from each 
available sensory modality are expected to converge prior to the 
computation of the movement vector. 

8. CONCLUSIONS 

In this article we have formulated computational models that 
rely on multiple concurrent computations carried out in multi- 
ple reference frames in order to optimally drive the hand to a 
target. We have compared these concurrent models to the more 
conventional viewpoint that presupposes the use of a single, 
common reference frame for combining multi- sensory infor- 
mation. The concurrent models are attractive because of their 
modular structure and because they better explain a variety of 
empirical studies. Moreover, they place the question of how to 
combine sensory information and how to choose the reference 




FIGURE 8 | Hybrid convergent/concurrent versus full concurrent. 

Information flow predicted by the hybrid convergent/concurrent (A) and 
fully-concurrent (B) models for a V-VK condition in which the target can be 
sensed only visually, but the subject has both visual and kinesthetic 
information about the hand. Missing sources of information are 
represented by faded colors. Dashed lines represent sensory 
transformations and comparisons that can be neglected without a decrease 
in motor performance, given the extent to which the noise in these 
calculations correlates with the other comparisons. The fully-concurrent 
model, but not the Hybrid model, predicts that in the V-VK condition the 
reconstruction of the kinesthetic representation of the hand from visual 
feedback can be avoided. 
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frame (s) for any given task into a common theoretical frame- 
work, that of maximum likelihood estimation. They also make 
specific, testable predictions about the sensory transformations 
that are performed and the representations of target and hand 
that are maintained in working memory during the performance 
of sensorimotor tasks. In the spirit of this special issue on mod- 
ularity in motor control, we therefore propose that the CNS 
performs multisensory integration in a highly modular fashion, 
building up the required motor commands for targeted move- 
ments from a principled combination of elementary target-hand 
comparisons. 
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